30 research outputs found

    Automatic discovery of cross-family sequence features associated with protein function

    Get PDF
    BACKGROUND: Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. RESULTS: We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location. CONCLUSION: We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription

    Small chromosomal regions position themselves autonomously according to their chromatin class

    Get PDF
    The spatial arrangement of chromatin is linked to the regulation of nuclear processes. One striking aspect of nuclear organization is the spatial segregation of heterochromatic and euchromatic domains. The mechanisms of this chromatin segregation are still poorly understood. In this work, we investigated the link between the primary genomic sequence and chromatin domains. We analyzed the spatial intranuclear arrangement of a human artificial chromosome (HAC) in a xenospecific mouse background in comparison to an orthologous region of native mouse chromosome. The two orthologous regions include segments that can be assigned to three major chromatin classes according to their gene abundance and repeat repertoire: (1) gene-rich and SINE-rich euchromatin; (2) gene-poor and LINE/LTR-rich heterochromatin; and (3) genedepleted and satellite DNA-containing constitutive heterochromatin. We show, using fluorescence in situ hybridization (FISH) and 4C-seq technologies, that chromatin segments ranging from 0.6 to 3 Mb cluster with segments of the same chromatin class. As a consequence, the chromatin segments acquire corresponding positions in the nucleus irrespective of their chromosomal context, thereby strongly suggesting that this is their autonomous property. Interactions with the nuclear lamina, although largely retained in the HAC, reveal less autonomy. Taken together, our results suggest that building of a functional nucleus is largely a self-organizing process based on mutual recognition of chromosome segments belonging to the major chromatin classes

    The associations between childhood trauma, neuroticism and comorbid obsessive-compulsive symptoms in patients with psychotic disorders

    No full text
    Various studies reported remarkably high prevalence rates of obsessive-compulsive symptoms (OCS) in patients with a psychotic disorder. Little is known about the pathogenesis of this co-occurrence. The current study aimed to investigate the contribution of shared underlying risk factors, such as childhood trauma and neuroticism, to the onset and course of OCS in patients with psychosis. Data were retrieved from 161 patients with psychosis included in the 'Genetic Risk and Outcome in Psychosis' project. Patients completed measures of OCS and psychotic symptoms at study entrance and three years later. Additionally, childhood maltreatment and neuroticism were assessed. Between-group comparisons revealed increased neuroticism and positive symptoms in patients who reported comorbid OCS compared to OCS-free patients. Subsequent mediation analyses suggested a small effect of childhood abuse on comorbid OCS severity at baseline, which was mediated by positive symptom severity. Additionally, results showed a mediating effect of neuroticism as well as a moderating effect of positive symptoms on the course of OCS severity over time. OCS severity in patients with psychosis might thus be associated with common vulnerability factors, such as childhood abuse and neuroticism. Furthermore, the severity of positive symptoms might be associated with more severe or persistent comorbid OC

    Automatic discovery of cross-family sequence features associated with protein function-1

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic discovery of cross-family sequence features associated with protein function"</p><p>BMC Bioinformatics 2006;7():16-16.</p><p>Published online 12 Jan 2006</p><p>PMCID:PMC1395344.</p><p>Copyright © 2006 Brameier et al; licensee BioMed Central Ltd.</p>d classifiers are shown in panels A,C&E. In panels B,D&F, the correlation between the sequence-based classification and the annotation-based classification is shown for both training and testing data during the 8 h runs which produced the final individuals shown in panels A,C&E. Although these are hand-picked examples, note how the test set correlation generally follows the training set correlation in an upward trend. Because the test set proteins are minimally related to the training set proteins (less than 10% sequence identity), this shows that general sequence features related to function have been discovered

    Automatic discovery of cross-family sequence features associated with protein function-2

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic discovery of cross-family sequence features associated with protein function"</p><p>BMC Bioinformatics 2006;7():16-16.</p><p>Published online 12 Jan 2006</p><p>PMCID:PMC1395344.</p><p>Copyright © 2006 Brameier et al; licensee BioMed Central Ltd.</p>etic programming approach. In this figure, an 8 × 8 Self-Organizing Map (SOM) is used to cluster the predictors based on the pattern of sequence-based test set classifications. Predictors which classify similar subsets of the sequences will be localised to the same region of the map. Each SOM node is annotated as follows (the example used is at row 3 column 2): the number of "A-type" and "B-type" predictors which map to this node (e.g. "4A + 2B"); the common target words for the annotation-based classifier and their frequencies (e.g. "2 biosynthesis, 2 mitochondrial"); the inset boxes show which annotation words are over-represented in the test set sequences which are positively classified by the sequence-based classifier (e.g. "oxidised"). See Methods for detailed information

    Automatic discovery of cross-family sequence features associated with protein function-0

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic discovery of cross-family sequence features associated with protein function"</p><p>BMC Bioinformatics 2006;7():16-16.</p><p>Published online 12 Jan 2006</p><p>PMCID:PMC1395344.</p><p>Copyright © 2006 Brameier et al; licensee BioMed Central Ltd.</p>ds (to the right). The evolutionary search produces two independent classifiers which act on the two types of information. Fictional examples of these classifiers are shown. Two binary vectors are produced from the application of these classifiers to their respective inputs. Ideally, a pair of classifiers would produce identical (non-trivial) binary vectors. The goal of the evolutionary search is to maximise the correlation between these vectors

    Automatic discovery of cross-family sequence features associated with protein function-5

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic discovery of cross-family sequence features associated with protein function"</p><p>BMC Bioinformatics 2006;7():16-16.</p><p>Published online 12 Jan 2006</p><p>PMCID:PMC1395344.</p><p>Copyright © 2006 Brameier et al; licensee BioMed Central Ltd.</p>ved regular expressions which may influence the classifier in a positive or negatively way. As described in the Methods, this positive or negative influence can be determined with an approximation method. The positively influencing regular expressions are matched against test set sequences (cuts 1 to 4 of the data individually, or pooled together, indicated with "All" in the figure). The 500 most-matched residues or sequence fragments are then analysed manually for recurrent patterns. In panel A, we summarise the sequence features that are important for predictors of the functions: "nuclear", "transcription" and "DNA". As expected, sequence features containing multiple lysine and arginine residues are an important signal in nuclear proteins (the pattern [KR] {3} is found in approximately 15% of the top 500 positively influencing residues for "nuclear" predictors). Other signals thought to be involved in protein-protein interactions in the nucleus are also identified by this analysis: repeated acidic residues and polyglutamine. The polyglutamine feature, and particularly polyglutamine flanked by at least one of the residues D/R/H/A/N/K/S/L/E/P/T, is a stronger signal for "transcription" predictors. In panel B, the same analysis is performed for predictors of "cytoplasmic", "biosynthesis" and "catalyzes". In this case only single-residue "features" are apparent from the data. For instance, aromatic residues are more important for predictors of "biosynthesis" and "catalyzes" than for "cytoplasmic" (green bars)

    Automatic discovery of cross-family sequence features associated with protein function-4

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic discovery of cross-family sequence features associated with protein function"</p><p>BMC Bioinformatics 2006;7():16-16.</p><p>Published online 12 Jan 2006</p><p>PMCID:PMC1395344.</p><p>Copyright © 2006 Brameier et al; licensee BioMed Central Ltd.</p>over 537 test set sequences are shown in scatter plot form. The red points show scores for two identical but independently trained "nuclear" predictors. As expected, a strong correlation exists between the scores of these two predictors. The blue points show scores from a "nuclear" predictor plotted against the scores from a "transcription" predictor. The scores are still quite well correlated but the distribution of points mainly below the diagonal suggests that proteins that get high scores for "nuclear" do not always have equally high scores for "transcription", which agrees with general observations that not all nuclear proteins are involved in transcription (but all transcription proteins are nuclear). In panel B, accuracy . coverage plots are shown for the four combinations of predictors trained and/or tested on "nuclear" and/or "transcription". The data shown here are for the pooled test set proteins from a four-fold cross-validation experiment. The noteworthy result here is the increased performance of the "transcription"-trained predictor (blue line) compared to the "nuclear"-trained predictor (magenta line) when predicting "transcription". Panels C & D show the equivalent data for "secreted" . "inhibits" predictors Panels E & F show the data for "cytoplasmic" . "biosynthesis" predictors
    corecore